Data Exploration - Bingo Aloha

Get packages:

## Warning: package 'tweedie' was built under R version 3.6.2
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'fitdistrplus' was built under R version 3.6.2
## Warning: package 'MASS' was built under R version 3.6.2
## Warning: package 'survival' was built under R version 3.6.2
## Warning: package 'plotly' was built under R version 3.6.2
## Warning: package 'RSQLite' was built under R version 3.6.2

Load and clean data:

df_all = read.csv("/Users/PeterNovak/Desktop/Bayesian AB Testing/test_data_ba_1.csv")
df_all$user_id    = as.character(df_all$user_id)
df_all$test_group = as.character(df_all$test_group)
df_all$is_payer   = ifelse(df_all$total_spend == 0, 0, 1)
head(df_all)
df_payers = df_all[df_all$is_payer == 1, ]
head(df_payers)

Clear Problem of Distribution of Player Spend:

One reason for this problem is the presence of duplicate values in the distribution, even worse in winsorised spend as it represents a higher proportion of total player spend:

To visualise and compare to a hypothesised continuous distribution which will always assume the probability of an exact value is 0 thus the probability the same value happens twice in a finite sampled data set is 0 as well.

So how well do most distributions do? The distribution is closest to tweedie out of them:

distributions = c(rlnorm, rexp, rtweedie)
distribution_name = c("lnorm", "exp", "tweedie")
dist_obj = get_dist(df_payers$total_wins_spend, distributions, distribution_name,
                    c("plum", "dodgerblue", "darkgreen"), T, F)
## Warning in fitdist(data = input_list, method = "mle", distr = "tweedie", :
## The dtweedie function should return a vector of with NaN values when input has
## inconsistent parameters and not raise an error
## Warning in fitdist(data = input_list, method = "mle", distr = "tweedie", : The
## ptweedie function should return a zero-length vector when input has length zero
## and not raise an error

dist_obj[1:2]
## $param_est
##      power         mu        phi 
##  3.0890528 23.0561183  0.2190246 
## 
## $model_used
## [1] "tweedie"

And all criterions outperform log-normal distribution with opt. parameters. p-value of ks-test 0.07

Conclusion:

The tweedie distribution is a better fit for our data, however the presence of duplicate spend values will make any unimodal distribution a likely poor fit. However it would have to be a highly irregular distribution because of the nature of the data (explained logic below):

  1. User makes a spend by buying one of x packages with a certain value.
  1. This value is likely to be a “clean” value in the local currency (e.g. 0.99$, 1.99$, 4.99$, 9.99$, 49.99$, 99.99$)

  2. Local currency is translated to USD but those spend combination are likely to be duplicitous as well.

  3. a & b repeats when the same player makes a subsequent transaction which adds up to total spend per user.

  1. All spends per user are summed up to make a per user total spent.

  2. Daily spends are winsorized -> daily spend above certain threshold is set to threshold -> likely to create duplicitous total spend per user if user only spends on one day and spends above limit.

This process leads to dupl. spend values.

So what now?

One caveat, is assuming the data follows a exponential distribution even wrong?

"One worry might be that selecting a model after considering the data is HARKing (hypothesizing after the results are known; Kerr 1998). In that famous article, Kerr discusses why HARKing may be inadvisable. In particular, HARKing can transform Type I errors (false alarms) into confirmed hypothesis ergo fact. I.e. the non-linear trend in the population data might be a random fluke, and the better fit by the non-linear distribution might be a random fluke.

Bayesian analysis is always conditional on the assumed model space. Often the assumed model space is merely a convenient default. The default is convenient because it is familiar to both the analyst and the audience of the analysis, but the default need not be a theoretical commitment. There are also different goals for data analysis: Describing the one set of data in hand, and generalizing to the population from which the data were sampled. Various methods for penalizing overfitting of noise are aimed at finding a statistical compromise between describing the data in hand and generalizing to other data."

Further:

"Indeed, according to Kleijn, v.d Vaart (2012), in the misspecified case, the posterior distribution:

See for papers about updating prior distributions alongside prior parameters when presented with new data:

Next steps:

Investigate best fit distribution for each client data set.

Create a “population” of spenders by analysing every dataset we have so far about player spend.

Lets try it for each firm:

load_df = function(file_path) {
  df_all = read.csv(file_path)
  df_all$user_id    = as.character(df_all$user_id)
  df_all$test_group = as.character(df_all$test_group)
  df_all$is_payer   = ifelse(df_all$total_spend == 0, 0, 1)

  df_payers = df_all[df_all$is_payer == 1, ]
  return(df_payers)
}


visualise_df = function(df, name) {
  name          = gsub(" ", "\n", name)
  n_breaks      = max(min(nrow(df)/200, 40), 20)
  temp_dens     = density(df$total_wins_spend, adjust = 2)
  temp_dens_log = density(log(df$total_wins_spend), adjust = 2)
  
  hist(df$total_wins_spend,
    col    = "floralwhite",
    breaks = n_breaks,
    prob   = TRUE,
    xlab   = "Payer Winz. Spend",
    main   = "Hist.: Payer Winz Spend")
  lines(temp_dens,
    col = "firebrick4")
  legend("topright", legend = name, bty = "n")

  hist(log(df$total_wins_spend),
    col    = "floralwhite",
    breaks = n_breaks,
    prob   = TRUE,
    xlab   = "Payer log(Winz. Spend)",
    main   = "Hist.: Payer ln(Winz Spend)")
  lines(temp_dens_log,
    col = "firebrick4")
  legend("topright", legend = name, bty = "n")
}

descdist_to_list = function(descdist_obj) {
  l_out = c(descdist_obj$min,
            descdist_obj$max,
            descdist_obj$median,
            descdist_obj$mean,
            descdist_obj$sd,
            descdist_obj$skewness,
            descdist_obj$kurtosis)
  
  return(l_out)
}
df_bingo_aloha   = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_bingo_aloha.csv")
df_homw          = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_homw.csv")
df_idle_mafia    = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_idle_mafia.csv")
df_spongebob     = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_spongebob.csv")
df_terra_genesis = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_terra_genesis.csv")
df_ultimex       = load_df("/Users/PeterNovak/Desktop/Bayesian AB Testing/data/data_ultimex.csv")
par(mar = c(4, 3, 3, 3), mgp = c(2, 0.5,0))
graphics::layout(mat = matrix(1:6, nrow = 2, ncol = 3, byrow = F),
       heights = rep(1, 6), widths = rep(1, 6))

visualise_df(df_bingo_aloha, "Bingo Aloha")
visualise_df(df_homw, "HOMW")
visualise_df(df_idle_mafia, "Idle Mafia")

visualise_df(df_spongebob, "Spongebob")
visualise_df(df_terra_genesis, "Terra Genesis")
visualise_df(df_ultimex, "Ultimate X-Poker")

n_boots = 500

dist_bingo_aloha   = descdist(df_bingo_aloha$total_wins_spend, boot = n_boots)

dist_homw          = descdist(df_homw$total_wins_spend, boot = n_boots)

dist_idle_mafia    = descdist(df_idle_mafia$total_wins_spend, boot = n_boots)

dist_spongebob     = descdist(df_spongebob$total_wins_spend, boot = n_boots)

dist_terra_genesis = descdist(df_terra_genesis$total_wins_spend, boot = n_boots)

dist_ultimex       = descdist(df_ultimex$total_wins_spend, boot = n_boots)

mat_out = matrix(data = c(descdist_to_list(dist_bingo_aloha),
                          descdist_to_list(dist_homw),
                          descdist_to_list(dist_idle_mafia),
                          descdist_to_list(dist_spongebob),
                          descdist_to_list(dist_terra_genesis),
                          descdist_to_list(dist_ultimex)),
                 nrow = 6, ncol = 7, byrow = T,
                 dimnames = list(c("Bingo Aloha",
                                   "HOMW",
                                   "Idle Mafia",
                                   "Spongebob",
                                   "Terra Genesis",
                                   "Ultimate X-Poker"),
                                 c("Min",
                                   "Max",
                                   "Median",
                                   "Mean",
                                   "SD",
                                   "Skewness",
                                   "Kurtosis")))
as.data.frame(mat_out)
distributions = c(rlnorm, rweibull, rexp)#, rtweedie)
distribution_name = c("lnorm", "weibull", "exp")#, "tweedie")
colours = c("plum", "darkgreen", "dodgerblue")#, "cyan")

fit_bingo_aloha   = get_dist(df_bingo_aloha$total_wins_spend,
                             distributions, distribution_name, colours, T, F, F)

fit_homw          = get_dist(df_homw$total_wins_spend/100,
                             distributions, distribution_name, colours, T, F, F)

fit_idle_mafia    = get_dist(df_idle_mafia$total_wins_spend/100,
                             distributions, distribution_name, colours, T, F, F)

fit_spongebob     = get_dist(df_spongebob$total_wins_spend,
                             distributions, distribution_name, colours, T, F, F)

fit_terra_genesis = get_dist(df_terra_genesis$total_wins_spend,
                             distributions, distribution_name, colours, T, F, F)

fit_ultimex       = get_dist(df_ultimex$total_wins_spend,
                             distributions, distribution_name, colours, T, F, F)

mat_out2 = matrix(data = c(fit_bingo_aloha$model_used,
                           fit_homw$model_used,
                           fit_idle_mafia$model_used,
                           fit_spongebob$model_used,
                           fit_terra_genesis$model_used,
                           fit_ultimex$model_used),
                  nrow = 6, ncol = 1, byrow = T,
                  dimnames = list(c("Bingo Aloha",
                                    "HOMW",
                                    "Idle Mafia",
                                    "Spongebob",
                                    "Terra Genesis",
                                    "Ultimate X-Poker"),
                                  c("Model Used")))
as.data.frame(mat_out2)

All get lnorm as best dist.